A retrospective study of mortality for perioperative cardiac arrests toward a personalized treatment

Perioperative cardiac arrest (POCA) is associated with a high mortality rate. This work aimed to study its prognostic factors for risk mitigation by means of care management and planning. A database of 380,919 surgeries was reviewed, and 150 POCAs were curated. The main outcome was mortality prior to hospital discharge. Patient demographic, medical history, and clinical characteristics (anesthesia and surgery) were the main features. Six machine learning (ML) algorithms, including LR, SVC, RF, GBM, AdaBoost, and VotingClassifier, were explored. The last algorithm was an ensemble of the first five algorithms. k-fold cross-validation and bootstrapping minimized the prediction bias and variance, respectively. Explainers (SHAP and LIME) were used to interpret the predictions. The ensemble provided the most accurate and robust predictions (AUC = 0.90 [95% CI, 0.78–0.98]) across various age groups. The risk factors were identified by order of importance. Surprisingly, the comorbidity of hypertension was found to have a protective effect on survival, which was reported by a recent study for the first time to our knowledge. The validated ensemble classifier in aid of the explainers improved the predictive differentiation, thereby deepening our understanding of POCA prognostication. It offers a holistic model-based approach for personalized anesthesia and surgical treatment.

www.nature.com/scientificreports/ rates varying between 20 and 60% [3][4][5][6][7] . Accurately predicting survival and promptly making correct decisions pose a huge challenge for anesthesiologists and clinicians under uncertain and dynamic environments. The incidence and causes of cardiac arrests related to anesthesia have been studied over the last two decades 1 . Nevertheless, understanding of POCA and controlling the related risk factors are still in their infancy. Several studies [8][9][10][11] analyzed individual variables associated with the survival of cardiac arrests by meta-analysis. The main issue with this approach is that effects are often multivariate rather than univariate, making results prone to bias. Multiple disease severity scores predicting survival have been developed as a tool for risk stratification after cardiac arrest [12][13][14][15][16][17][18][19][20][21] ; however, as they usually have suboptimal predictive accuracy for a specific patient population, they should be cautiously extrapolated and applied to an individual patient in hospital.
Recently, machine learning has emerged as an effective approach to integrate multiple quantitative variables to improve accuracy of incidence predictions in medicine, with the potential to dramatically improve healthcare delivery [22][23][24][25][26] . Specifically, in the fields of anesthesiology and cardiac arrest research, it has recently been shown that ML is a promising method for a more comprehensive understanding of the risk factors and a supporting tool for healthcare improvement 27-31 . Therefore, this study reported all cardiac arrests that occurred in a surgical population pre-intra-post anesthesia in one of the largest Chinese tertiary hospitals during an 8-year period, and examined causes of mortality with ML in addition to the univariate method of ANOVA. After validating the ML models, they can be used to identify the mortality risk factors and predict survival outcome of an individual patient. The data bring more information about anesthesia and surgery in addition to patients' demographic characteristics, and ML models may help offer more potential to understand and manage the procedure than traditional resuscitation algorithms 33 . This study may provide a basis for designing model-based prediction and care management strategies of anesthesia/surgery to improve the prognosis and survival of POCA. The aim of this retrospective study was to identify factors of POCA and help to improvement in the prevention and management of POCA. This work will open up an avenue for a personalized anesthesia and surgery strategy, with a better treatment and a higher survival rate attained.

Methods
Data collection. This retrospective study was approved by Human Research Ethics Committee of the First Affiliated Hospital of Zhengzhou University (number: KY-2021-0084). The study was registered at the Chinese Clinical Trial Registry (ChiCTR2100051737). The requirement for obtaining a written informed consent from patients was waived due to the retrospective nature of the study. The study was performed in accordance with the principles of the Declaration of Helsinki. Electronic medical records of 380,919 patients who had undergone a surgical procedure between December 2012 and June 2020 were reviewed by three of the authors (Huijie Shang, Qinjun Chu, Jin Guo) in July 2020. Brain-dead organ donors and babies undergoing cardiac compressions due to arrest immediately after caesarean section were excluded from the analysis. Patients on cardiopulmonary bypass or extracorporeal membrane oxygenation were also excluded because cardiac compressions are not needed in such situations and the use of such devices can significantly affect clinical outcomes. The "perioperative" period was defined as the time from entering the operating room to exiting the postanesthesia care unit. Cardiac arrest was defined as any condition that required performing chest compressions or defibrillation. From the anesthetic records, 150 patients who suffered POCA with a full record were selected for this study. Data were classified into patient demographic characteristics, operation-related variables, and cardiac arrest-related variables. The patients' demographic characteristics included gender, age, body mass index (BMI), comorbidities, emergency, trauma, and five-category physical status by ASA PS. The operation-related variables comprised anesthetic type, surgical type, operative position, the amount of blood lost, blood transfused anti-arrhythmic drug use, and continuous infusion of vasoactive drugs. Cardiac arrest-related variables included arrest cause, arrest time, whether or not defibrillation was done, and duration of CPR. The primary outcome was in-hospital mortality of POCA patients until hospital discharge. Statistical analysis. The patients' characteristics were compared by mortality outcomes. The statistical methods used in this work were the same as those used in a previous study 31 . The analyses were done in R programming language, version 3.6.1. The code has been uploaded (refer to Supplementary Document 1.2 in Online Supplementary Materials).
Machine learning models. Six algorithms were explored. Five of them, namely LR, SVC, RF, GBM, and AdaBoost, are the most commonly used algorithms for binary classification problems in medicine 37 . An additional one is an ensemble approach, which is realized through a voting classifier aggregating the prediction of multiple classifiers. Therefore, we designed VotingClassifier, which combines the predictions of the aforementioned five models to improve prediction robustness.
Notably, it was quite a challenge to obtain a robust and accurate ML model given that the data were scarce because POCA is a very rare event. A thorough effort was made in this work as follows.
1. A five-fold cross-validation resampling procedure was used to evaluate the models on the limited training data to reduce the prediction bias. The bootstrapping method was further leveraged to minimize the potentially large prediction variance. For each fold, we extracted the true positive rate and false positive rate and calculated the area under the receiver operating characteristic curve (AUC), the mean of which was used as the optimization metric. Based on this series of results, we obtained a confidence interval of AUC to show the robustness of an ML classifier. 2. Grid and random hyperparameter search were used to search for optimal hyperparameters.

Results
Patients' characteristics and statistical analysis. There were 380,919 patients who had undergone a surgical procedure, with 163 POCAs, of which 13 were excluded and 150 included ( Supplementary Fig. 2). As shown in Table 1, 150 POCA patients were investigated. A total of 81 patients died prior to hospital discharge, resulting in a survival rate of 46%. The average age was 49.4 (± 18.5) years, with 96 (64.0%) patients being male and 73 (48.7%) being emergency cases. A total of 145 (96.7%) patients underwent general anesthesia, and 91 (60.7%) patients were in ASA PS III-V. Fourteen patients experienced cardiac arrest during induction, while 13 patients experienced cardiac arrest during intubation. The majority of cardiac arrests (N = 102; 68.0%) occurred during surgery. The common causes of POCA were preoperative complications (N = 34; 22.7%), related to anesthesia (N = 23; 15.3%), and surgical complications (N = 41; 27.3%).
The following variables were significantly different between the survivor and non-survivor groups (P < 0.05): gender, surgical type, emergency, operative position, ASA PS, hemorrhage, blood transfusion, epinephrine, and CPR. Accordingly, the favorable categories for survival were female sex, throat (or other) surgery, non-emergency, and ASA PS I-II. In contrast, there was a higher probability of mortality in male individuals, neurosurgery, emergency, supine operative position, massive hemorrhage and blood transfusion, and ASA PS V. Other variables, such as age, BMI, trauma, arrest time, use of defibrillation, and arrest cause, were not significantly different between the two groups. There was no evidence that the administration of drugs except epinephrine was directly associated with survival or death.
In addition, comorbidities and medical history were generally not strongly associated with mortality. The observed difference in the presence of hypertension (P = 0.146) between the survival and death groups was 14.6%, which indicates that hypertension might be a remarkably influential comorbidity for further exploration. ML models. The 150 patients were split into two subgroups in a gender-stratified manner, i.e., 112 (75%) and 38 (25%) for training and testing of the ML models, respectively. To preserve the same gender proportions of patients in each subgroup as in the total patients, the data were split in a gender-stratified manner. The predicting outcome was the probability of mortality. Figure 1 shows the Receiver Operating Characteristic (ROC) curves generated with the test data by the six ML models, including LR, SVC, RF, GBM, AdaBoost, and VotingClassifier, and their AUCs were 0.84, 0.87, 0.91, 0.90, 0.87, and 0.90, respectively.
In binary classification, the most basic metric/bench-mark is the confusion matrix given that "accuracy," "precision," "recall," "f1-score," "ROC," and "AUC" all stem from the confusion matrix 38 . We used these multiperspective performance measures to fairly judge the predictive models.
As shown in Table 2 We further considered two aspects to analyze the prediction performance of the models. One was probability curves for each ML model (Fig. 2); another was model comparisons with respect to mortality estimation across age groups (Fig. 3).
As shown in Fig. 2, the LR estimated a higher probability of survival. Corresponding to a threshold of 50%, the false negative (FN) of mortality was 6, and the false positive (FP) was 4; this means that six patients who died were wrongly classified into survivors, while four patients who survived were wrongly predicted to have died. For the SVC model, FN was 5 and FP was 3, with low variance in probability attributed to all survivors. For the RF and the GBM, the misclassified values were smaller, i.e., FN = 4 and FP = 3. The VotingClassifer brought about the smallest misclassifications, with FN = 3 and FP = 3. In addition, the GBM and VotingClassifier demonstrated significant separation of the dead individuals from the survivors, with lower overlap between the two groups.
As shown in Supplementary Fig. 1, the VotingClassifier was the best classifier for age groups " < 12 years" and " ≥ 65 years", and probably the second best for age groups "12-40 years" and "40-65 years" (outmatched only by the RF). The GBM model tended to significantly overestimate mortality in age groups " < 12 years" and " ≥ 65 years".
To summarize, the ensemble ML model (VotingClassifier) outperformed all of the other classifiers by making better predictions and achieving better performance than any single contributing model. Moreover, it reduced the spread or dispersion of the predictions with higher robustness.
Model explainability. First, we applied the SHAP to explain predictions on the test data by the VotingClassifier. The SHAP summary, combining feature importance with feature effects, was visualized with violin plots to present the distribution of Shapley values (Fig. 3). The position on the y-axis was determined by the feature and that on the x-axis by the Shapley value.
All effects described the model behavior and were not necessarily causal in the real world, which was why we used the term "association" rather than "causation" in the above statements 32 .   www.nature.com/scientificreports/ Second, we interpreted the VotingClassifier with the LIME explainer, particularly to explore misclassification of prediction. Four typical cases corresponding to the four quadrants in a confusion matrix (TP, TN, FP, FN), were compared. The top 10 features are presented in Fig. 4, with the weight of each feature represented in either green or red depending on whether it favored survival or death, respectively.
In one FP case (Fig. 4c), the predicted probability of mortality was 62%, but the patient survived. The key unfavorable features for survival were CPR 30-60 min (~ 19% impact) and hemorrhage ≥ 800 mL (~ 12% impact), to which the misclassification could be attributed. In one FN case (Fig. 4d), the predicted probability of mortality was 39%. However, the patient died. The most survival-favorable feature was CPR ≤ 30 min (~ − 28% impact), www.nature.com/scientificreports/ which probably overpowered other survival-unfavorable features, such as emergency, thereby leading to this misclassification.

Discussion
Given the fact that POCA is a quite rare incidence, it is hard to access to an abundant amount of data. Moreover, there are few papers about cardiac arrests in China. The present study investigated POCA in 380,919 patients at a Chinese tertiary hospital. Overall, the incidence of POCA was 3.9 per 10,000 surgical procedures with a mortality of 54% prior to hospital discharge. All of the ML models used in this study, except for the LR, are "black-box" algorithms, which provide great accuracy at the cost of low interpretability 33 . There are multiple dangers of a decision made by ML without opening the black box, as follows: (1) It is usually hard to explain the predictions to clinicians, which is a barrier to the adoption of ML for high stakes decisions 35 ; (2) More and more concerns or regulations specific to ML have been emerging on interpretability and its predictive reasoning (for example, the EU General Data Protection Regulation).
First, a global model-agnostic method of permutation feature importance was employed in this work. The results were not shown in this article because some evident drawbacks of this method were found: (1) shuffling the feature added randomness and the results usually varied greatly; (2) some features were inherently correlated, and this method was very biased by unrealistic data instances.
In this study, SHAP and LIME were demonstrated to be two competent local model-agnostic methods in the model explainability. Instead of calibrating global feature contributions, these two methods train local surrogate models to explain individual predictions with more solid insights generated, such as how to rank a feature by importance with a favorable or unfavorable impact value on prediction outcome. We obtained contrastive explanations with the two explainers, particularly to explore misclassification, making the ensemble ML model more transparent and shedding light on their applications in clinical decision-making. Explanations can be used to interrogate and rectify the ensemble model when such a misclassification surfaces.
Our study has several limitations. First, although the ultimately validated ensemble model was robust and accurate, the size of data used was still relatively small. Specifically, there were only eight patients younger than Figure 3. SHAP importance plots of the mortality and risk factors for the ensemble ML model (VotingClassifier). The features are ranked by importance. Each row represents the impact of a feature on the outcome of mortality, with higher SHAP values indicating higher likelihood of a positive outcome. For a binary feature, like gender, "male" → "1" is shown in red while "female" → "0" is shown in blue. For the detailed mapping of categorical features, please refer to the code online (such as " < 12 ys" → "0", "12 ~ 40 ys" → "1", "40 ~ 65 ys" → "2", " > 65 ys" → "3"). www.nature.com/scientificreports/ 12 years, which was probably why most of the ML models (except the ensemble) failed to satisfactorily predict the outcome of this age group. In the future, more internal data and even external data may bring more benefits to establish generalizability and further increase the model fidelity. Second, our dataset had no information on post-arrest care and discharge disposition. Thus, it was impossible to systematically follow up and assess longterm recovery and survival of the discharged patients. Third, our study had a single-center retrospective design, and our dataset was abstracted from the electronic medical records by the researchers in this study, who had not been involved in the clinical treatment of the patients; therefore, the accuracy of the dataset was verified.
Finally, an ML model is not a "magic button, " although it would have reached a "super-human" performance. Like most ML approaches, the ML models validated in this study focused on predicting outcomes rather than on understanding causality, i.e., they found correlations but not causation. As an example, it was revealed in this study that two top predictors of risk for in-hospital mortality were CPR and epinephrine. The ensemble model predicted that longer CPR and higher dose of epinephrine were associated with a higher probability of death. In fact, the opposite was true; namely, patients (with severe ASA PS or massive hemorrhage) would be at a higher risk for serious complications and sequelae, even mortality, if insufficient CPR and/or epinephrine treatment were not timely delivered.
In clinical practice, accurate prediction models allow for improved medical prognostication, earlier identification of patients at high risk of complications, better risk adjustment and utilization of critical care resources, and more effective patient-physician-family communication.
In this study, the validated ensemble model provides superior prediction accuracy by virtue of high fidelity to data across various age groups and high robustness to uncertainty, as well as good discrimination between survivors and non-survivors. The data comprised operative parameters in addition to patients' demographic characteristics, which makes it possible to integrate operational optimization and/or tactical planning with the model by managing the operative parameters and procedure.  Features with a green bar favored survival, and those with a red bar were predictive of mortality. The x-axis shows how much each feature added to or subtracted from the final probability value for the patient. Each weight can be interpreted in the context of the original probability; if a feature is absent for a patient, it can be numerically added to or subtracted directly from the initial probability. www.nature.com/scientificreports/ One application scenario is early recognition of problems and suggestion of actions to avoid critical events. For an individual patient, an optimal combination of anesthesia management, surgical type, operative position (if optional), and treatment drugs could lead to a significantly improved probability of survival until hospital discharge (some exploratory simulations were done but not shown in this article). Another application scenario is that the model could enhance rational patient risk monitoring during operations, with drug doses administered in a timely fashion (target-controlled infusion), resulting in precision, efficacy, and safety of intravenous anesthesia delivery. In short, this model-based optimization opens an avenue for a personalized anesthesia and surgery strategy, with a better treatment and a higher survival rate attained.
Furthermore, clinicians hesitate to apply a black-box algorithm that is hard for them to understand and trust 33,34 . In this work, the explainers (LIME and SHAP) may pinpoint logics of decision-making and mitigate issue of clinical liability, encouraging clinicians to understand and leverage ML to assist decision-making and change management in practice.

Conclusion
The ensemble ML model makes solid predictions of mortality on the data of POCA patients' demographic and operative parameters, bringing a more comprehensive understanding of the risk factors and patient prognostics prior to hospital discharge, compared to the approach with ANOVA. Furthermore, the explainers of LIME and SHAP provide a more comprehensible and holistic approach to the assessment of prognosis of an individual patient. All of these results may assist risk management of in-hospital cardiac arrest with improved patientcentered and personalized care.